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Colon and rectum cancer share many risk factors, and are often 
tabulated together as "colorectal cancer" in published summaries. 
However, recent work indicating that exercise, diet, and family his- 
tory may have differential impacts on the two cancers encourages 
analyzing them separately, so that corresponding public health in- 
terventions can be more efficiently targeted. We analyze colon and 
rectum cancer data from the Minnesota Cancer Surveillance Sys- 
tem from 1998-2002 over the 16-county Twin Cities (Minneapolis-St. 
Paul) metro and exurban area. The data consist of two marked point 
patterns, meaning that any statistical model must account for ran- 
domness in the observed locations, and expected positive association 
between the two cancer patterns. Our model extends marked spatial 
point pattern analysis in the context of a log Gaussian Cox process to 
accommodate spatially referenced covariates (local poverty rate and 
location within the metro area), individual- level risk factors (patient 
age and cancer stage), and related interactions. We obtain smoothed 
maps of marginal log-relative intensity surfaces for colon and rectum 
cancer, and uncover significant age and stage differences between the 
two groups. This encourages more aggressive colon cancer screening 
in the inner Twin Cities and their southern and western exurbs, where 
our model indicates higher colon cancer relative intensity. 

1. Introduction. 

1.1. Etiologies of colon and rectum cancer. Traditionally, public health 
agencies have reported colon and rectum cancers together under the title 
"colorectal cancer." Since the turn of the last century, however, an active 
debate has emerged regarding whether these two cancers really have suffi- 
ciently similar etiologies to be aggregated in this way. Some experts have 
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argued that the cancers should be reported and monitored separately, so that 
public health interventions can be more sensibly and efficiently targeted. 

A variety of epidemiological studies have indicated variables that may 
have a differential impact on colon and rectum cancer. These variables fall 
under three broad categories. The first is exercise. A very recent study by 
the Physical Activity Guidelines Advisory Committee (2008) identified 23 
publications on this general topic, arising from 12 prospective cohort studies 
and 8 case-control studies. These studies show a consistent inverse relation 
between physical activity and colon cancer risk, with this relation being sta- 
tistically significant for at least one physical activity domain and one sex in 
9 of the 12 cohort studies and 5 of the 8 case-control studies. More specif- 
ically, the median relative risk (RR) comparing most- versus least-active 
subjects was 0.7 over all the studies. The advisory committee stated that 
this overall finding was unlikely to be the result of confounding, since the 
studies for the most part included relevant covariates, such as body mass in- 
dex (BMI), smoking, alcohol, diet, screening, menopausal status, and family 
history of colon cancer. By contrast, the committee found the studies to in- 
dicate no apparent relationship between physical activity and rectal cancer 
risk. Specifically, more than half the studies showed no statistically signifi- 
cant associations, and the median RR over all the studies was 1.0. The fact 
that moderate-to-vigorous physical activity (say, 30 to 60 minutes per day) 
may be protective against colon cancer but not rectal cancer suggests that 
these two cancers should be treated separately in cancer registry reporting 
and subsequent statistical modeling. 

A second broad area that may have a differential impact on the two can- 
cers is diet. Diet has long been suspected as an etiological factor for colorectal 
cancer; however, studies of individual foods and nutrients have often pro- 
vided inconsistent results, perhaps due to low statistical power. Flood et al. 
(2008) address this problem using factor analysis to group dietary variables 
into three broad groups, and go on to conclude that lower consumption of 
meat and potatoes, and higher consumption of fruit, vegetables, and fat- 
reduced foods, are associated with reduced colorectal cancer risk. Wei et al. 
(2003) use data from two prospective cohort studies (87,733 women from 
the Nurses' Health Study and 46,632 men from Health Professionals Follow- 
Up Study) to investigate the effect of dietary variables on colon and rectum 
cancer separately. In the combined cohort, a variety of variables emerge as 
significant predictors of elevated colon cancer risk, including intake of beef, 
pork or lamb as a main dish, intake of processed meat, and alcohol con- 
sumption. However, none of these variables emerge as predictors of rectal 
cancer. Using data from the Iowa Women's Health Study, Folsom and Hong 
(2005) showed that magnesium and calcium intake were independently as- 
sociated with significantly lower colon cancer risk, but not rectum cancer 
risk. The relative risk estimates changed little across baseline subgroups, 
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such as women who did or did not use hormone replacement therapy, or 
were or were not diabetic. Using data from a different study, Flood et al. 
(2005) conclude that the protective effect of calcium is present regardless of 
whether the calcium arises naturally in food, or is delivered through dietary 
supplements. Finally, Pedersen, Johansen and Gronbaek (2003) observed a 
dose-response relationship between alcohol and rectal cancer in a Danish 
cohort of 15,491 men and 13,641 women who did not include wine in their 
alcohol intake. However, no association between alcohol and colon cancer 
was found. 

The third broad area where the etiologies of colon and rectal cancer appear 
to differ is family history. For those who reported a family history of colon 
or rectal cancer, Fuchs et al. (1994) obtained a RR of 1.99 for colon cancer 
but just 0.86 for rectal cancer, a statistically significant difference based on a 
simple chi-square test with one degree of freedom. Wei et al. (2003) drew the 
same conclusion, but using a different dataset and a stepwise polytomous 
logistic regression procedure. 

1.2. MCSS data and problem description. Our specific problem of inter- 
est involves the comparison of the spatial distributions of colon and rectum 
cancer patients in the state of Minnesota. These data are collected by the 
Minnesota Cancer Surveillance System (MCSS), a program sponsored by 
the Minnesota Department of Health. The MCSS includes the residential 
address of essentially every person diagnosed with cancer in Minnesota. Here 
we consider the subset of patients diagnosed during the period 1998-2002 
(an interval chosen partly for its centering around a U.S. Census year, 2000). 
Figure 1 shows the 7 counties comprising the Twin Cities metro area as 
those encircled by the dark boundary; also shown are 9 adjacent, exurban 
counties. Within these 16 counties, we have 6544 individuals for analysis. 
Figure 1 plots the approximate locations of the cancers after the addition 
of a random "jitter" to protect patient confidentiality (explaining why some 
of the cases appear to lie outside of the spatial domain) . The physiological 
adjacency of the colon and the rectum suggests positive dependence in these 
point patterns; persons with rectum cancer beyond stage 1 (i.e., regional or 
distant) are at risk for colon cancer due to metastasis. Moreover, the two 
cancers likely share unmodeled spatially-varying risk factors (such as local 
health care quality or availability), also implying positive dependence. This 
may help health care providers or public health policy makers to identify 
regions of excessive risk requiring intervention (say, a direct mail campaign 
encouraging more aggressive screening) or other weak links in the health 
care system. 

The causes of colon and rectum cancer are unknown. Age is the primary 
risk factor, with disease incidence increasing significantly after the age of 50. 
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As already mentioned, family medical history may also be helpful in predict- 
ing colon and rectum cancer risk, along with several lifestyle factors such 
as alcohol use, smoking, diet, and exercise. Unfortunately we do not have 
access to this information for individuals in the MCSS, but the lifestyle fac- 
tors could reasonably be expected to cluster spatially due to corresponding 
sociodemographic clustering. We also have census tract-level poverty rates, 
which should be correlated with these risk factors. 

A full analysis of the data in Figure 1 would account for the randomness 
in the observed locations, their spatial correlation, important covariates (in- 
cluding population density) , and any other hierarchical structure in the data 
(such as the tendency of model residuals to cluster spatially). The output of 
such an analysis would include maps of the fitted adjusted log- intensity sur- 
face, point and interval estimates for important main effects and interactions 




Fig. 1. Jittered residential locations of colon (light circle) and rectum (dark circle) cancer 
cases, Twin Cities metro and exurban counties, 1998-2002. 
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(e.g., location-age), and perhaps maps of fitted spatial residual surfaces, to 
help identify spatial covariates still missing from the model. 

1.3. Statistical modeling of spatial point patterns. In spatial disease map- 
ping settings, the primary goals are typically to investigate the connections 
between the disease and (possibly geographically-indexed) covariates, to 
characterize the spatial variation of the disease occurrence, and to iden- 
tify areas having elevated disease risk. In such cases, the data are often ag- 
gregated to counts within specified areal regions (counties, zip codes, etc.). 
Indeed, most published statistical analyses to date of data of this type use so- 
called areal or lattice models; see, for example, Banerjee, Carlin and Gelfand 
[(2004), Chapter 3 and Section 5.4] for a review. However, if precise geocoded 
locations of disease cases are available, it is more appealing to study the 
resulting spatial point pattern using spatial point process modeling. How- 
ever, such methods are conceptually and computationally more challenging, 
and are implemented in fewer widely available statistical software programs. 
Indeed, even when actual geocoded locations are available, a standard com- 
putational strategy is to partition the study region and model the counts in 
each cell of the partition as conditionally independent Poisson observations, 
obtaining the standard areal model but with an arbitrary partition. 

Under a nonhomogeneous Poisson process, the likelihood for the intensity 
surface generating the locations given the observed locations is well known 
[see, e.g., Benes et al. (2002); Diggle (2003); M0ller and Waagepetersen 
(2003)]. We begin with this likelihood, but introduce the following features. 
First, we accommodate covariate information in a novel way. We envision 
certain covariates as conditional, that is, we seek to compare point patterns 
given levels of these covariates. For us, these are cancer type covariates which 
"mark" the point pattern. We view other patient level characteristics (or risk 
factors) such as age as nuisance variables for which we wish to adjust. We 
then model point patterns jointly over geographic space and nuisance covari- 
ate space, enabling the notions of both conditional and marginal intensity 
associated with geographic space. Hence, we obtain an intensity adjusted 
for these covariates, rather than an intensity which ignores them by not in- 
cluding them in the model. Moreover, we also have purely spatial covariates, 
some of which are available at areal unit level (say, county-level features), 
while others may be available at point level (say, distance from a location 
to the nearest cancer screening facility). Employing spatial information at 
both scales precludes aggregation of points to counts. 

Additionally, we anticipate dependence between the intensity surfaces as- 
sociated with the two cancers, since, for example, an excess of colon cancer 
in a portion of the study region may suggest correspondingly high levels 
of rectum cancer. We capture this dependence using multivariate process 
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realizations for the intensities. Last, working with the above point level like- 
lihood, as well as fairly large numbers of points (e.g., order 10 3 ), necessitates 
approximation to implement the model-fitting. 

The analysis of spatial point patterns has a reasonably long history in 
the literature, initially built using exploratory tools such as distance based 
methods yielding F functions, G functions, and, perhaps most commonly, 
Ripley's K function. All are based upon assessing departure from complete 
spatial randomness (CSR), which is interpreted as a homogeneous Poisson 
process and for which closed forms for these functions exist. However, no like- 
lihood is specified, and comparison between point patterns is not possible. 
Another more recent approach involves the use of spatial scan statistics, cur- 
rently popular in large part due to the SatScan software of Kulldorff (2006). 
But again, no likelihood is specified so inference is limited to say detection 
of "hot spots." 

To achieve the foregoing objectives, we instead adopt a model-based fo- 
cus, and write the intensity of the process as A(s), where s €T> for some 
spatial domain T>. For a collection of observed cancer case locations Si,i = 
1, . . . , n, we work with the likelihood L(A(s), s € D; {sj}™ =1 ) which takes the 

form e~$o A ( s ) ds fj™ =1 A(s^). Often, A(s) is specified as a parametric func- 
tion, for example, using a basis representation or a tiled surface. Adding a 
prior distribution on these parameters, say, 9, yields a posterior distribution 
p(A(s; #)|{sj}) for making inferences about the intensity surface. 

For us, A(s) is thought of as a log Gaussian process (GP) realization, 
resulting in the familiar class of Cox processes [M0ller and Waagepetersen 
(2004), page 57]. To specify this prior distribution, we require /x(s), the 
mean surface, along with a 2 and tf>, the GP covariance parameters. Below, 
we express /x(s) in part with a form z'(s)(3, so that the process mean can 
depend on spatially referenced covariates z(s). 

A common class of estimation methods for inhomogeneous spatial point 
process models avoids full likelihood evaluations by formulating estimating 
equations [Waagepetersen (2007); Waagepetersen and Guan (2009)]. Guan 
and Loh (2007) study the distributional properties of the estimation proce- 
dure of Waagepetersen (2007), and obtain variance estimates using a thinned 
block bootstrap procedure. Guan (2006) developed a composite likelihood 
method based on the second-order intensity function of the underlying pro- 
cess. Diggle and Rowlingson (1994) handle bivariate (case-control) point 
processes via a conditional likelihood approach to convert the two spatial 
point process models into an easier-to-fit nonlinear binary regression model. 
Similarly, Guan, Waagepetersen and Beale (2008) estimate correlation func- 
tions via either a consistent nonparametric kernel smoothing estimator, or 
a parametric conditional likelihood estimator. In all of these approaches, 
inference on spatial associations and second-order variations proceeds not 
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from the intensity surface, but from pairwise correlation functions and trans- 
forms thereof [e.g., the g and K functions in Waagepetersen (2007)]. As such, 
they do not offer direct attacks on the intensity surface estimation problem, 
needed for inference regarding the fitted surface itself, its rate of change 
at any point [as needed for spatial boundary analysis or "wombling"; see 
Banerjee and Gelfand (2006), and Liang, Banerjee and Carlin (2009)], or 
model-based comparison of the surfaces for colon and rectum cancer. 

As such, we instead adopt a fully Bayesian approach that yields posterior 
distributions for the intensity surface, or even the spatial residual surface 
after adjusting for regressors that are allowed to differ for the two cancers. 
Due to the absence of sufficient covariate (e.g., diet) information in our 
dataset, we introduce spatially varying random effects which we view as 
surrogates for these missing covariates. Inference is exact and does not rely 
upon possibly inappropriate use of infill or increasing-domain asymptotics. 
However, our more comprehensive approach comes with a price. Specifically, 
note that if A(s) is modeled as a random realization of a spatial process, then 
the likelihood integral is stochastic, precluding explicit evaluation. Indeed, 
a variety of computational challenges emerge in working with the point- 
level likelihood in this case: the stochastic integration, the large collection 
of spatial locations, and a prior specification that is only available through 
finite dimensional distributions. 

Wolpert and Ickstadt (1998) offered one of the first fully Bayesian ap- 
proaches for spatially nonhomogeneous Poisson process data. Benes et al. 
(2002) illustrate one possible Bayesian analysis of a log Gaussian Cox pro- 
cess model. While they assume A(s) constant over grid cells, they do utilize 
the notion of the population intensity surface, and obtain fitted disease maps 
under a variety of models (constant, Gaussian kernel, etc.) for this surface. 
However, they do not consider joint modeling of multiple disease surfaces, 
nor covariates that are not location-specific (e.g., the age or cancer stage of 
a case observed at a particular location) . 

In this paper we employ novel spatial point process approaches that ac- 
count for both location-specific and nonlocation-specific covariates in the 
context of multiple dependent point processes to analyze the MCSS dataset. 
We begin in Section 2 with a brief review of spatial point process modeling. 
We then go on to present a set of multivariate spatial point process models 
for investigating the effect of both location-specific and nonlocation-specific 
covariates, as well as their interactions. Section 3 then gives the results of 
applying them to our MCSS dataset. Finally, Section 4 discusses our find- 
ings and offers directions for future research in this area. Computational 
challenges in fitting our models are addressed in the supplemental article 
[Liang, Carlin and Gelfand (2009)]. 
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2. Hierarchical modeling for spatial point processes. 

2.1. Modeling with spatial covariates. We begin with a brief review of 
the basics of log Gaussian Cox process modeling. Consider a set of ran- 
dom locations which we denote by S = {sj}™ =1 where disease occurrence is 
observed over a spatial domain D. We model this random set of locations 
using a nonhomogeneous Poisson process with intensity function A(s) for all 
seD. Let z(s) be a vector of location-specific covariates corresponding to a 
disease case observed at s. For us, a key component of z(s) is the indicator 
of whether the case is in the metro area or not. However, in other contexts, 
we could envision information such as elevation, climate, exposure to pol- 
lutants, and so on to be relevant. We model A(s) = r(s)7r(s), where r(s) 
is the population density surface at location s. In practice, we may create 
such a surface using GIS tools and census data, or we may just work with 
areal unit population counts, letting r(s) = 7T,(A) /| v4| if s G A, where n(A) 
is the number of persons residing in A and \A\ is the area of A. The error 
introduced by this admittedly crude estimate may be mitigated somewhat 
by resorting to the non-Bayesian estimating equation alternatives discussed 
in Section 1.3. Specifically, in this framework one could model the two point 
processes separately [Waagepetersen (2007)] or jointly [by an extension of 



Returning to our framework, r(s) serves as an offset and 7r(s) is inter- 
preted as a population adjusted (or relative) intensity, which we model on 
the log scale as 



where w(s) is a zero-centered stochastic process, and (3 is an unknown vector 
of regression coefficients. If w(s) is taken to be a Gaussian process, then the 
original point process is called a log Gaussian Cox process [LGCP; M0ller 
and Waagepetersen (2004), page 72]. The likelihood associated with (3 and 
W D = {^( s ) : s ^ D} given S takes the form 



Operating formally, a prior on wd along with a prior on (3 completes the 
Bayesian specification. Inference proceeds from the posterior which, again 
formally, is p{f3, wd\S) oc L(j3, wd\ S)p((3)p{w d) ■ Of course, the Gaussian 
process is only defined through its finite dimensional distributions so that, 
practically, this posterior is viewed in terms of a finite collection of loca- 
tions. This motivates discrete approximation of the stochastic integral as we 
discuss below. One discrete approximation partitions D into a collection of 
sets (say, Ai,i = 1,2, . . . ,m) and creates a Poisson likelihood for the counts 



Guan (2006)]. 
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given X(Ai). That is, it models log7r(^4j), thus precluding use of point level 
covariate information. Moreover, since ir(Ai) = J a /k(s) ^ exp(/ A . (z(s)'/3 + 
w(s))ds), it is inappropriate to utilize the latter, simpler integration. In- 
deed, ignoring this inequality can introduce ecological fallacy issues; see, for 
example, Wakefield and Salway (2001) for a discussion. 

We pursue an alternative discrete approximation which still enables us 
to work at the point level. Suppose we replace J D A(s) with some choice 
of numerical integration. For the moment, we allow analytic possibilities as 
well as Monte Carlo versions, since in either case, we will end up replacing 
Wd with a finite set, say, w* = {w(sj),j = 1,2,... ,T}. Then we revise (2) to 

(3) L(/3, w*,w(s 1 ), w(s n );S)p(w* , io(si), . . . , w(s n ))p((3). 

Now, we only need to work with an (n + T)-dimensional random variable to 
handle the io's, hence, their prior is just an (n + T) -dimensional multivariate 
normal distribution. Note that, in (3), we will require that z(s) be available 
at each tjj that is, we require the component z(s) surfaces over D. These 
surfaces are not viewed as random and may be interpolated or tiled, accord- 
ing to the nature of the information for the particular spatial covariate; we 
assume only that we can assign a value of z for each s € D. 

2.2. Introducing nonspatial covariate information. So far we have indi- 
cated how to incorporate covariates that are spatially referenced into the 
modeling. In our setting, we seek to introduce nonspatial covariates which 
we think of as being of two types (though the distinction will depend upon 
the application). One type of covariate provides the "marks" leading to a 
marked point process model. For us, this covariate is cancer type (colon vs. 
rectum), and we are interested in whether the two cancer intensity patterns 
differ. 

The second type of covariate we view as an "auxiliary" variable that 
provides additional information associated with intensity. For us, age and 
cancer stage are examples of such covariates. Clearly patient age is associated 
with cancer intensity, but the strength of this association may differ across 
cancers. We wish to adjust intensity to reflect patient age, analogous to the 
age standardization used in aggregated areal data settings. 

In general, we view these latter covariates as continuous 2 and introduce 
a second argument into the definition of the intensity, yielding a surface in 
s and v over the product space D xV (i.e., geographic space by covariate 
space). We then generalize (1) to 

(4) vr(s,v) = exp(/3 + z(s)'/3 + v'a + (v <g> z(s)) ; 7 + io(s)), 



2 In the case of a discrete valued covariate, any integrals over v in our development are 
replaced by sums. 
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where the Kronecker product v (g> z(s) denotes the set of all the first order 
multiplicative interaction terms between z(s) and v. When a particular in- 
teraction term is not of interest, the corresponding coefficient in -f is set to 
zero. This expression envisions a conceptual intensity value at each (s,v) 
combination. The interaction terms between spatial and nonspatial covari- 
ates provide the ability to adjust the spatial intensity by individual risk 
factors. If we fix v in (4), we can view A(s, v) = r(s)7r(s, v) as a "condi- 
tional" intensity at level v. If we integrate over v (see below), we obtain the 
(cumulative) marginal intensity A(s) associated with 7r(s,v). 

Now, introducing marks k = 1, 2, . . . , K, a general additive form for the 
log relative intensity is 

(5) logvr fc (s, v) = f3 0k + z'(s)(3 k + v'a k + (v® z(s))7 fc + w k (s). 

We can immediately interpret the terms on the right side of (5). The global 
mark effect is captured with the 0Q k . Therefore, there is no intercept in z(s) 
and we have mark- varying coefficients for the spatially-referenced covariates, 
reflecting the possibility that these covariates can differentially affect the in- 
tensity surfaces of the marks. Similarly, we have mark-varying coefficients 
for the nuisance variables. We also have mark- varying coefficients for the 
interaction terms, reflecting possibly different effects of the nonspatial co- 
variates over spatial domains. Finally, we allow the spatial random effects to 
vary with mark, that is, a different Gaussian process realization for each k. 
Dependence in the w k (s) surfaces may be expected (say, increased intensity 
at s for one marked outcome encourages increased intensity for another at 
that s) , suggesting the need for a multivariate Gaussian process over the w k . 
Both separable and nonseparable forms for the associated cross-covariance 
function are conveniently specified through coregionalization [Gelfand et al. 
(2004); Banerjee, Carlin and Gelfand (2004), Sections 7.1 and 7.2]. 

Reduced models of (5) are immediately available, including, for exam- 
ple, w k (s) = w(s), f3 k = (3, and a k = a. Another interesting reduced model 
obtains by setting j k = 0, leading to 

(6) logvr fc (s,v) = (3 0k + z'(s)(3 k + v'a k + w k (s). 

This separable form enables us to directly study the effect of the marks on 
spatial intensity. Specifically, the intensity associated with (5) is 

(7) Xk(s,v) = exp(/3 fc + v'(x k ) xr(s)exp(z'(s)/3 fc + w fc (s)). 

We see a factorization into nonspatial nuisance and spatial covariate terms. 
Presuming the former is integrable over v, the latter, up to a constant, is 
the "marginal spatial intensity." 

Integration of Afc(s,v), based upon (5), can be computed analytically 
in most cases. When v is categorical, the likelihood integral involves only 
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integration over the spatial domain D. When v is continuous, simple algebra 
shows 

/ A fc (s, v) dv ds = r(s) exp((3 0k + z(s)'/3 fc + w k (s)) 
Jv 

x / exp(v'a fc + (v(g)z(s)) / 7 fc )dv. 
Jv 

Suppose, for instance, that there is only one component in z(s) and one 
component in v having range (vi,v u ). Provided a k + z{s)~f k 7^ 0, the marginal 
intensity Afc(s) is then 

Afc(s, v) dv ds 

= r(s) exp(/3 fc + (3 k z(s) + w k {s)) x / exp(v(a k + z(s)7 fe )) dv 

Jv 

= r(s) exp((3 0k + (3 k z(s) + w k (s)) 
1 



[exp(v u (a k + «(s)7 fc )) - exp(u z (a fc + z(s)7 fc ))]. 



«fc + z(s)j k 

Turning to the revised likelihood associated with (5), let {(s^j, v k i),i = 
1,2, ... , n k } be the locations and nuisance covariates associated with the n k 
points having mark k. The likelihood becomes 

(8) Y[exp(- J^J^\ k (s,v)dvds^J x J| X k(ski,v ki ). 

Using the calculations above, the double integral becomes 
Afe(s, v) dv ds 



D JV 



I) 



r(s) exp(/3 0k + k z(s) + w k (s)) 
1 



-[exp(v u (a k + z(s)j k )) - exp(vi(a k + z(s)j k ))} ds, 
a k + z(s)j k J 

provided that the set {s : a k + z(s)'y k = 0} has Lebesgue measure zero. Hence, 
the difficulty in the likelihood evaluation is the same as in (2) and can be 
treated in the manner described in conjunction with (3). In this regard, note 
that we bound the components of v in order to integrate explicitly over v. 
We do not have a stochastic integration with regard to V as we have over 
D. Of course, sensitivity to the chosen bounds should be investigated. 

Last, in the case of k = 2 marks, a common alternative model specifica- 
tion is logistic regression, which views the mark as the response given the 
locations and covariates. This is conditioning in the opposite order from our 
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model, which views the locations and covariates as random given the marks. 
In our dataset it seems more natural to compare point patterns for the two 
different types of cancer, rather than view cancer type as some sort of binary 
"response" to covariate information. 

3. Results. We now present the results of our analysis of the MCSS colon 
and rectum cancer data. Previous studies suggest that covariates related to 
a patient's socioeconomic status (SES) may be related to the patient's risk 
factors through its impact on diet, health care quality, or propensity to seek 
care. While our dataset lacks any individual-level SES measures, from cen- 
sus data we have several related tract-level variables: percentage of farm 
population, percentage of rural population, percentage of people with less 
than high school education, percentage of minority population, and poverty 
rate. A preliminary population-adjusted nonspatial Poisson regression anal- 
ysis of our data on these covariates revealed only poverty rate and the metro 
indicator as significant predictors. 

In our initial model, we consider two location-specific covariates: Zi(s), 
the metro area indicator, and Z2(s), the poverty rate in the census tract 
containing s. We also employ two nonlocation-specific covariates: v\, cancer 
stage [set to 1 if the cancer is diagnosed "late" (regional or distant stage) 
and otherwise], and V2, the patient's age at diagnosis. The population 
density r(s) we use for standardization is available at 2000 census tract 
level, meaning that we assume population density is constant across any 
tract. The integral of the intensity is approximated by a Monte Carlo sum 
using a predictive process approximation [Banerjee et al. (2008)]; see the 
supplemental article by Liang, Carlin and Gelfand (2009) for full details. 

The left and middle columns of Figure 2 show maps of the raw mean 
nonspatially varying covariates (age and proportion diagnosed late), while 
the right column maps a crude estimate of relative intensity for colon can- 
cer (top row) and rectum cancer (bottom row). Notice these summaries are 
presented at tract level, even though we have exact (or nearly exact) spa- 
tial coordinates here. In the first two columns, tracts containing no cases 
are simply shaded according to the overall observed mean values for each 
disease, which are 69.9 and -64.8 for age and 0.618 and 0.555 for propor- 
tion diagnosed late for colon and rectum cancer, respectively. None of these 
four maps show strong spatial patterns, though we do see several areas with 
higher than average age, late diagnosis fraction, or both. The right column 
maps the logs of the numbers of cases divided by total number of residents 
in each tract. These crude maps of the tract-level log relative intensity (un- 
adjusted for any spatial or nonspatial covariates) show somewhat stronger 
spatial patterns and higher overall rates of colon cancer. The rectum can- 
cer map features an interesting collection of low outlying values in several 
outer-ring suburban census tracts. 
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Table 1 

Table of colorectum cancer patients ' characteristics in metro and adjacent area of 
Minnesota. Percentages across appropriate columns are given m parentheses, and 
"ratio" gives the ratio of colon to rectum cases 





Total 


Late = 


Late = 1 


Metro 


Nonmetro 


All 


6544 


2606 (40%) 


3938 (60.2%) 


5481 (83.8%) 


1063 (16.2%) 


Colon 


4857 


1855 (38%) 


3002 (61.8%) 


4079 (84%) 


778 (16%) 


Rectum 


1687 


751 (44.5%) 


751 (55.5%) 


1402 (83.1%) 


285 (16.9%) 


Ratio 


2.88 


2.47 


4.0 


2.91 


2.73 



Table 1 breaks down the data by stage and metro/nonmetro area. We see 
that 38% of colon cancer cases were diagnosed at an early stage, while 44.5% 
of rectum cancer cases were. In total, colon cancer is nearly three times as 
prevalent as rectum cancer in both the metro and nonmetro areas. A fact 
not revealed by the table is that there are 72 individuals who contribute both 
a colon and a rectum tumor. Since this is only around 1% of the total of 6544 
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Table 2 

Model comparison using effective model size po and DIC score. GLM refers to 
generalized linear model having no random effects 



Model 


Pd 


DIC 


GLM (no residuals) 


11.8 


1194.4 


Univariate spatial residuals 


72.0 


692.4 


Bivariate spatial residuals 


80.2 


688.8 



individuals, we do not explicitly model this particular kind of dependence, 
but rather "lump it in" with the bivariate dependence modeled by p. 

Figure 3 shows tract-level maps of population density, r(s), and our two 
location-specific covariates, Z\(s) and Z2 (s). Not surprisingly, the central 
metro areas are the most populated. The poverty rate is fairly uniform except 
for high rates in a concentrated portion of the central metro. 

We now fit our model, using independent Inverse Gamma(2, 0.5) priors 
for o\ and erf, and a Unif(— 0.999, 0.999) prior for p. The scale of the spatial 
decay parameter <f> is determined by the distance function employed. In this 
application, we started with a Unif (130, 390) prior for <f), so that the effective 
range lies between one-fourth and three-fourths of the maximal distance 
between any two knots. As expected, <fi is only weakly identified, so a fairly 
informative prior is needed for satisfactory MCMC behavior. For simplicity, 
we simply fix the range parameter at <p = 195, so that the effective range is 
roughly half of the maximal distance. A random-walk Metropolis-Hastings 
algorithm is used to draw posterior samples. 

Table 2 compares the effective model size and DIC score of three models. 
It can be seen that the no-random effect model (GLM) is unacceptably bad, 
and the model with a single set of spatial residuals is not much worse than 




Fig. 3. Left, population density by tract; middle, metro /nonmetro area; right, poverty 
rate by tract. 
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the bivariate residual model. This suggests that the two sets of residuals are 
fairly similar, and that p is close to 1. 

Table 3 shows parameter estimates from some of our models. We param- 
eterize so that the top rows concern the fixed effects for colon cancers, /3 1; 
but the second set of rows give the differential effect in the rectum cancer 
group, A = f3 2 — Pi- Thus, any 95% Bayesian confidence intervals that ex- 
clude in this part of the table suggest a variable that has a significantly 
different impact on the two cancers. 

In general, the effects of the non-spatial covariates are fairly similar across 
the models considered. We find that in the metro area there are relatively 
fewer cases of both colon and rectum cancer. This is consistent with statewide 
patterns of colorectal cancer occurrence in Minnesota, where higher age- 
adjusted rates are often found in nonmetro areas. However, there is no sig- 
nificant change in this relationship in the rectum group relative to the colon 
group. An interesting and somewhat counterintuitive finding is that poor 
areas seem to have relatively fewer cases. This appears consistent with the 
aforementioned finding of Wei et al. (2003) that colon cancer is associated 

Table 3 

Parameter estimates for the model with metro indicator and poverty rate as the spatial 
covariates, and stage and age as individual covariates. The estimates for rectum are 
relative effects to colon cancer. BSR = bivariate spatial residual model, USR = univariate 
spatial residual model, GLM= no random effects model 

Fitted model 



93 knots 









BSR 




USR 




GLM 




Colon intercept 


-8 


76 


(-9.12,-8.44) 


-8.75 


;-9.25,-8.40) - 


-8.91 


(-8.99, 


-8.83) 


metro 


-0 


23 


(-0.49,0.04) 


-0.19 


(-0.42,0.06) - 


-0.21 


(-0.29, 


-0.14) 


poverty 


-2 


01 


(-2.47,-1.55) 


-1.90 


;-2.36,-1.47) - 


-0.26 


(-0.61, 


0.09) 


age 





36 


(0.31,0.40) 


0.36 


(0.31,0.40) 


0.32 


(0.28,0. 


36) 


late 





48 


(0.42,0.54) 


0.48 


(0.42,0.54) 


0.48 


(0.43,0. 


54) 


metro*age 


-0 


06 


(-0.11,-0.02) 


-0.06 


(-0.11,-0.02) - 


-0.06 


(-0.11, 


-0.02) 


Rectum-colon intercept 


-0 


86 


(-1.08,-0.65) 


-0.84 


;-1.00,-0.68) - 


-0.84 


(-1.01, 


-0.69) 


metro 





02 


(-0.21,0.26) 


-0.07 


(-0.22,0.08) - 


-0.07 


(-0.22, 


0.09) 


poverty 





14 


(-0.70,0.98) 


-0.24 


(-1.06,0.52) - 


-0.22 


(-1.00, 


0.49) 


age 


-0 


18 


(-0.26,-0.10) 


-0.18 


(-0.26,-0.10) - 


-0.18 


(-0.25, 


-0.11) 


late 


-0 


26 


(-0.37,-0.15) 


-0.26 


(-0.38,-0.15) - 


-0.26 


(-0.37, 


-0.15) 


metro*age 





06 


(-0.03,0.15) 


0.05 


(-0.03,0.15) 


-0.01 


(-0.08, 


0.07) 


P 





98 


(0.95,0.99) 












4> 


195 




195 















95 


(0.57,1.48) 


0.76(0.43,1.33) 













75 


(0.41,1.33) 
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with foods often consumed by relatively more affluent people (beef, pork, or 
lamb as a main dish, and other processed meat). However, unlike these au- 
thors, we find no significant difference in this relationship for rectum cancer. 

Turning to the nonlocation-specific covariates, age is significantly associ- 
ated with increasing colon cancer, but a somewhat surprising relative de- 
crease in rectum cancer. This difference (—0.18) is statistically significant, 
but not large enough in magnitude to make the overall age effect in the rec- 
tum group negative. A look at the data bears this out, with rectum cancers 
arising in a somewhat younger population; our preliminary Poisson regres- 
sion also concurs, though here the relative decrease in the rectum group 
is not significant. Late detection provides another interesting difference be- 
tween the colon and rectum groups: while there are significantly more cases 
diagnosed late than early, the effect of late diagnosis is significantly smaller 
in the rectum group (point estimate —0.26). Thus, public health interven- 
tions to encourage screening and early detection of colorectal cancer will 
have significantly greater impact on prevention for colon than for rectum. 
The metro-age interaction shows that the effect of age on colon cancer is 
significantly less pronounced in the metro smaller "age adjustment" 

to the colon cancer intensity process is needed in the metro area. This effect 
is largely absent for rectum cancer, but this difference is not quite statis- 
tically significant. Finally, the estimate of p is very close to 1, indicating 
very similar spatial residual patterns. This is perhaps a surprisingly strong 
association, but believable given that these are residual surfaces, which ac- 
count (at least conceptually) for important missing covariates, which could 
be spatial (e.g., local screening percentage, other sociodemographic factors) 
or nonspatial (e.g., the physiological adjacency of the colon and the rectum). 

Our Bayesian viewpoint allows us to make probabilistic statements both 
against and in favor of various null hypotheses of interest. For example, 
suppose we take 20% as the minimum difference in relative intensity re- 
quired to conclude a practically meaningful difference between the colon and 
rectum cancer groups. For predictor i, this amounts to a test of Ho:Ai € 
[log(0.8),log(1.2)] versus H a :Ai<£ [log(0.8),log(1.2)]. The posteriors sum- 
marized in the second group of rows in Table 3 enable us to compute poste- 
rior odds ratios OR = P(H a \S)/ P(Ho\S) for any predictor of interest. In our 
dataset, the two predictors of greatest substantive interest yield different re- 
sults. Living in the metro area produces OR = 0.12, or odds of just over 8:1 
in favor of no real difference in the colon and rectum groups. However, for 
late detection we obtain OR = 2.81, or nearly 3:1 odds in favor of a practi- 
cally meaningful difference (in this relative intensity reduction in the 
rectum group). Again, this suggests a public health program encouraging 
more aggressive cancer screening would be sensibly targeted to those living 
in regions with higher colon cancer relative intensity, since this should lead 
to a more meaningful reduction in cancer prevalence. 
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Fig. 4. Log-relative intensity surfaces using values at centroid of each census tract at 
the mean age and assuming an early diagnosis. The top row is for colon cancer and the 
bottom for rectum cancer. The first column is the log-relative intensity surfaces without 
spatial residuals. The second column is the spatial residuals and the third is the complete 
log-relative intensity surfaces. 

Figure 4 shows maps of the fitted log intensity surfaces both without (left 
column) and with (right column) the spatial residuals (middle column), for 
a case at the mean age and diagnosed at an early stage. Without spatial 
residuals, the two spatial covariates alone predict slightly higher prevalence 
in the nonmetro areas. However, the residuals (which unlike our spatial co- 
variates are point-level, and are thus summarized using image-contour maps) 
indicate further reductions are needed in the near southern, southeastern, 
and northern suburbs, as well as the far north exurban area. This leads to 
the more mottled fitted patterns in the rightmost column. Note these final 
two maps in the right column result in spatially smoothed versions of the 
corresponding maps in the right column of Figure 2, which we recall are 
something like raw log-relative intensity surfaces. While direct comparison 
is not really possible since the maps in the right column of Figure 4 are 
adjusted for both spatial and nonspatial covariates, the overall similarities 
further confirm the good fit scores achieved by our models. From a practical 
point of view, when combined with the significant differences in age and late 
detection between the two cancers found in Table 3, our findings encourage 
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more aggressive colon cancer screening in the inner Twin Cities and the 
far southern and western exurbs, where the upper right panel of Figure 4 
indicates higher colon cancer relative intensity. 

4. Discussion. We have offered an analysis of colon and rectum cancer 
incidence data collected by the Minnesota Cancer Surveillance System dur- 
ing the period 1998-2002. In so doing we extended customary spatial point 
pattern analysis in the context of a log Gaussian Cox process model to ac- 
commodate covariates that are spatially referenced, individual-level cancer 
type marks, and individual-level risk factors that are not of interest in terms 
of marking. Our approach yields easy-to-interpret fixed effects for testing for 
equality of epidemiological properties across the two cancers, and fitted maps 
that can reflect the impacts of the spatially indexed covariates, spatial resid- 
uals, or both. These last maps also offer spatially smoothed fitted surfaces 
reminiscent of those in traditional areal models, but now adjusted for the 
nonspatially varying covariates, age and cancer stage. 

As with many observational data analyses, our findings raise as many 
questions as they answer. The somewhat counterintuitive negative relation- 
ship between tract-level poverty and colorectal cancer shown in Table 3 
might be the result of unmodeled confounding between age and poverty: poor 
areas could very well be significantly younger (especially in the metro, which 
features a higher proportion of immigrants, who tend to be younger). Since 
colorectal cancer is so highly associated with age, the apparent beneficial 
effect of poverty might just be another manifestation of the protective effect 
of youth. Similarly, modestly negative metro-age interaction may be due to 
more common use of colorectal screening in the metro area. Such screening 
methods can reduce colorectal cancer incidence by identifying pre-malignant 
lesions (polyps) and removing them; failure to screen a population might in- 
crease both the number of cases and the ages at which the cancers were 
diagnosed. Sadly, we currently lack the individual-level income and screen- 
ing information necessary to precisely address these questions. Moreover, 
the MCSS database also does not feature information on diet, exercise, or 
family histories of the patients, the three previously-identified factors most 
likely to be responsible for any differences between colon and rectum cancer 
relative hazards. The data collected by MCSS is determined by legislation, 
and to expand it in any way requires a change in Minnesota state law, at- 
tempts at which the Minnesota Department of Health prefers to keep as rare 
as possible. As a result, future research regarding epidemiological properties 
of colon and rectum cancer should perhaps focus on obtaining approval and 
funding for a follow-up questionnaire mailed to all MCSS patients. 

Even more fundamentally, an increasing number of authors view the de- 
bate over whether colon and rectum cancers have different etiologies as mis- 
placed, arguing that the real distinction is not colon versus rectum, but 
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rather proximal (right, or ascending) versus distal (left, or descending) colon, 
the latter of which includes the rectum. These authors argue that colorectal 
cancer is not a single disease, but two distinct diseases with distinct molec- 
ular profiles. One of these is more commonly found in the distal colon, and 
derives from hyperplastic polyps, whose putative successor lesions, serrated 
adenomas, represent discrete steps along a pathway to cancer [Huang et al. 
(2004)]. By contrast, the cancers most common in the proximal colon arise 
from an entirely different molecular pathway [O'Brien et al. (2006)]. Dif- 
ferences in risk factors for these two pathways are not well established but 
are nonetheless entirely likely. Relatedly, Glebov et al. (2003) found more 
than 1000 genes expressed differentially in adult ascending versus descend- 
ing colon. Thus, the real sub classification of interest may not be colon versus 
rectum or distal colon versus proximal colon, but rather molecular pathol- 
ogy. Of course, this line of thinking encourages a reporting of cancers that 
is well beyond the capabilities of most U.S. public health reporting (hence 
intervention) systems, but the idea bears watching. 

On the brighter side, recent audits have suggested our MCSS dataset is 
over 99% complete; that is, due to state reporting requirements, we are aware 
of essentially every tumor discovered by doctors in Minnesota. However, 
our methods obviously cannot reflect tumors that are not discovered or 
otherwise not reported. To the extent that such tumors happen unevenly 
across the spatial domain, this could lead to bias in our fitted estimates and 
maps. We do not think differential underreporting is a problem across our 
current, relatively compact and relatively urban 16-county spatial domain, 
but datasets that reached further into more remote regions of the state 
(especially semi-autonomous Native American tribal lands) may well suffer 
from this problem. 

Future work in this area also includes extending our model with more 
complex interaction terms and perhaps more than two marks (the MCSS 
database has information on more than 20 cancers), leading to more chal- 
lenging model-fitting. Another issue to address is the imprecision in the (typ- 
ically rural) addresses within the point pattern. In some cases error may be 
simply due to the sensing device (e.g., the GPS unit), while in others it may 
be due to the practical limits of geocoding: for some of the cancers in our 
MCSS data, a significant proportion of the geocodes may be based on less 
than a complete and valid street address (e.g, residence zip + 2, residence 
zip only, or even the zip of a post office box). A final, perhaps most interest- 
ing path for the future lies in space-time point pattern analysis, in order to 
see evolution of cancer intensities over time. In the case of continuous time, 
we would now add a time argument to our intensity functions, leading to 
substantially increased scope for the modeling (e.g., separable versus non- 
separable models for the space-time intensity) . If time is instead viewed as 
discrete, we might instead extend our framework to temporally dynamic log 
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Gaussian Cox process models. Both of these options, while computationally 
challenging, could pay significant practical dividends. 
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SUPPLEMENTARY MATERIAL 

Computational Issues (DOI: 10. 1214/09- AOAS240SUPP; .pdf). We pro- 
vide full details of the Monte Carlo algorithms needed to approximate the 
complex point process likelihoods in the paper. In particular, we flesh out the 
details of our knot-based predictive process approximation, and give general 
guidelines for how the knots should be selected in any given application. 
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